For this first experimental analysis, I fetched the Swedish auto insurance dataset, which consists of only two variables:
The number of claims
Total payment for all the claims in Swedish Kronor for geographical zones in Sweden
Reference about the dataset can be found at: Swedish committee on Analysis of risk premium in motor insurance.
I split the dataset into a training set and a testing set as follows:
##
## Call:
## lm(formula = total_payment ~ claims_no, data = dd_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.463 -23.883 0.401 22.676 81.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.8205 6.5351 3.492 0.000999 ***
## claims_no 3.4157 0.1998 17.094 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.15 on 51 degrees of freedom
## Multiple R-squared: 0.8514, Adjusted R-squared: 0.8485
## F-statistic: 292.2 on 1 and 51 DF, p-value: < 2.2e-16
It can be seen that both the intercept and the claims number variable are significant in explaining the total payment on claims. The R squared value reported 0.8485 is higher implying that approximately 86% of the variation in total payments is accounted for by the model.
The mean squared and rooted mean squared metrics for the pure regression model are shown below:
## [1] "The MSE is 1629.12176172261"
## [1] "The RMSE is 40.3623805259627"
From now on, we shall treat this regression model as the baseline model, in order to analyze performance of the noise added models:
In this example, we examine the Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, income, and demographic characteristics of those sampled (potential U.S. customers). We study a random sample of 275 households with positive incomes that were interviewed in the 2004 survey that purchased term life insurance. We wish to accurately determine family characteristics that influence the amount of insurance purchased.
The data is split into the training and testing set as follows:
The variables of interest in our case are:
EDUCATION - Number of years of education of the survey respondent
INCOME - Annual income
NUMHH - Number of household members
FACE - Quantity of insurance is measured by the policy FACE.
Since the variables of interest FACE and INCOME, are highly skewed, i applied a log tranformation in analysis, the plot describes the variables.
##
## Call:
## lm(formula = FACE ~ ., data = dd_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3327 -0.9315 0.0941 0.8509 4.6355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41441 0.93808 3.640 0.000340 ***
## EDUCATION 0.18147 0.04786 3.792 0.000193 ***
## NUMHH 0.26516 0.07139 3.714 0.000258 ***
## INCOME 0.46581 0.08451 5.512 9.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.547 on 221 degrees of freedom
## Multiple R-squared: 0.2926, Adjusted R-squared: 0.283
## F-statistic: 30.47 on 3 and 221 DF, p-value: < 2.2e-16
The fitted model is shown above, where only 28% of the total variation in FACE value is explained by the model, although all the variables are significant. This low variability explained could be due to the few number of variables used in explaining the model.
The mean squared and rooted mean squared metrics for the pure regression model are shown below:
## [1] "The MSE is 2.09223338317749"
## [1] "The RMSE is 1.44645545495791"
From now on, we shall treat this regression model as the baseline model, in order to analyze performance of the noise added models:
I fitted a total of 30 noise added models, from the 1 neighbour noise-model to the 30 neighbor noise model, and their performance in terms of the MSE metric is visualized below in comparison to the baseline model
As can be seen, the noise added models performed way better than the baseline regression model, one noise added model even performed 80% more better than the baseline model, indicating a higher level of accuracy of the noise added models.
An investor’s decision to purchase a stock is generally made with a number of criteria in mind. First, investors usually look for a high expected return. A second criterion is the riskiness of a stock, which can be measured through the variability of the returns. Third, many investors are concerned with the length of time that they are committing their capital with the purchase of a security. Many income stocks, such as utilities, regularly return portions of capital investments in the form of dividends. Other stocks, particularly growth stocks, return nothing until the sale of the security. Thus, theaverage length of investment in a security is another criterion. Fourth, investors are concerned with the ability to sell the stock at any time convenient to the investor. We refer to this fourth criterion as the liquidity of the stock. The more liquid is the stock, the easier it is to sell. To measure the liquidity, in this study, we use the number of shares traded on an exchange over a specified period of time (called the VOLUME). We are interested in studying the relationship between the volume and other financial characteristics of a stock.
We begin this analysis with 126 companies whose options were traded on December 3, 1984. The stock data were obtained from Francis Emory Fitch Inc. for the period from December 3, 1984, to February 28, 1985.
Although the data had a lot more variables, the variables i chose include the following:
The three-month total trading volume (VOLUME, in millions of shares)
The three-month total number of transactions (NTRAN)
The average time between transactions (AVGT, measured in minutes)
I split the data into the training and testing set as follows:
Training set: Having 73 data points. Testing set: Having 50 data points.
The fitted regression model is shown below:
##
## Call:
## lm(formula = VOLUME ~ NTRAN + AVGT, data = dd_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1864 -2.2956 -0.5324 1.2571 16.0270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.3181566 1.9107521 3.307 0.00149 **
## NTRAN 0.0014545 0.0001476 9.857 7.24e-15 ***
## AVGT -0.4704387 0.1923462 -2.446 0.01697 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.285 on 70 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.7978
## F-statistic: 143.1 on 2 and 70 DF, p-value: < 2.2e-16
From the output, the regression model accounts for 79% of the total variability in the response variable VOLUME.We also note that all the variables are significant.
The error metrics for the regression model is:
## [1] "The MSE is: 20.0736660618088"
## [1] "The RMSE is: 4.48036450099864"
I fitted a total of 30 noise added models, and their comparison to the benchmark regression model is visualized below:
In this experimental work, i show that we could still incorporate categorical variables in noise adding, by using one-hot encoding as shown below:
The categorical variables included in this analysis are:
Gender - Gender of the Survey respondent
MARSTAT - Marital status of the survey respondent
The data with categorical variables is as shown below:
Explain using the full report
The full regression model is shown below:
## SEDUCATION TOTINCOME MARSTAT GENDER FACE
## 1 16 43000 MARRIED MALE 20000
## 2 8 0 MARRIED MALE 130000
## 5 12 1020000 MARRIED MALE 0
## 6 14 0 PARTNER MALE 220000
## 7 0 0 OTHER FEMALE 0
## 8 17 84000 MARRIED MALE 600000
The fitted regression model is as shown below:
##
## Call:
## lm(formula = FACE ~ ., data = dd_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2618173 -442135 -270598 -62034 13729402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.126e+05 3.763e+05 -1.894 0.05893 .
## SEDUCATION 7.606e+04 2.361e+04 3.222 0.00137 **
## TOTINCOME 3.058e-02 1.263e-02 2.421 0.01589 *
## MARSTATOTHER 8.096e+05 3.632e+05 2.229 0.02629 *
## MARSTATPARTNER -2.356e+05 2.615e+05 -0.901 0.36799
## GENDERMALE 1.735e+05 2.337e+05 0.742 0.45820
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1311000 on 444 degrees of freedom
## Multiple R-squared: 0.06109, Adjusted R-squared: 0.05051
## F-statistic: 5.777 on 5 and 444 DF, p-value: 3.512e-05
The accuracy metrics of the regression model (we will refer this model to as the baseline model).
## [1] "The MSE is: 419574732281.392"
In order to get to noise added models, we will have to calculate distances, thus since it is impossible to calculate distances using categorical data, we employ techniques used in converting categorical data into numerical data e.g. one hot encoding, splitting, dummy coding e.t.c In this particular case, i employ one hot encoding and the final dataset is shown below:
## SEDUCATION TOTINCOME FACE MARRIED PARTNER OTHER MALE FEMALE
## 1 16 43000 20000 1 0 0 1 0
## 2 8 0 130000 1 0 0 1 0
## 5 12 1020000 0 1 0 0 1 0
## 6 14 0 220000 0 1 0 1 0
## 7 0 0 0 0 0 1 0 1
## 8 17 84000 600000 1 0 0 1 0
In noise adding, i used thirty noise added models, ranging from the one neighbor model to the thirty neighbor model, and the performance is illustrated graphically, as shown below: